Choosing LSI Dimensions by Document Linear Association Analysis

نویسندگان

  • Chenggen Shi
  • Jie Lu
چکیده

Latent Semantic Indexing (LSI) has proven to be a valuable analysis tool with a wide range of applications. however the crucial question, choosing an appropriate number of dimensions for LSI, is still unsolved. In this paper. a new method which is to deal with this problem is described. It finds that a sum of total dot products between all document vectors reaches the maximum value at a specific number of dimensions for a given dataset With this reduced dimensions LSI achieves the best performance. The performance evaluations have demonstrated that this method can choose an appropriate number of dimensions for LSI and effective detect the data structure for a dataset.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Text Mining Model by Using Weighting Technology

In Latent Semantic Indexing (LSI) has been proven to be a valuable analysis tool with a wide range of applications. However choosing an appropriate number of dimensions for LSI is still a crucial challenge. This paper provides a document vector model, by using weighting technology, to deal with this problem. Our experimental results have demonstrated that this model can detect a dataset structu...

متن کامل

LSI vs. Wordnet Ontology in Dimension Reduction for Information Retrieval

In the area of information retrieval, the dimension of document vectors plays an important role. Firstly, with higher dimensions index structures suffer the “curse of dimensionality” and their efficiency rapidly decreases. Secondly, we may not use exact words when looking for a document, thus we miss some relevant documents. LSI (Latent Semantic Indexing) is a numerical method, which discovers ...

متن کامل

A probabilistic model for Latent Semantic Indexing

Dimension reduction methods, such as Latent Semantic Indexing (LSI), when applied to semantic space built upon text collections, improve information retrieval, information filtering and word sense disambiguation. A new dual probability model based on the similarity concepts is introduced to provide deeper understanding of LSI. Semantic associations can be quantitatively characterized by their s...

متن کامل

Using Linear Algebra for Intelligent Information Retrieval

Currently, most approaches to retrieving textual materials from scienti c databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one ca...

متن کامل

Linear Discriminant Analysis in Document Classification

Document representation using the bag-of-words approach may require bringing the dimensionality of the representation down in order to be able to make effective use of various statistical classification methods. Latent Semantic Indexing (LSI) is one such method that is based on eigendecomposition of the covariance of the document-term matrix. Another often used approach is to select a small num...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003